COVID-19 Data Analysis

Purpose

I got off a call with a F1 Simulator Software Architect the other day. It was during our talk that Pandas was mentioned for processing data. I never heard of Pandas before, so I did my research after the call. I went to one of my full-time friends in the Vision Automation team at Tesla and asked him what is Pandas. He made the joke that it is like endangered species and then proceeded to tell me the actual software -- Pandas. He said it was a library in python that is used a lot to process dataset in Excel with a bunch of data. I was like SOLD! Anything to make my life easier and more learning! I had prior experience in python, Excel, and C# from UT Austin and Tesla so I was excited. I also want to code in Jupyter Notebook as well so used this project for it.

Thought Process

Overview

Research on YouTube about Pandas
Apply the knowledge by trying an example

Process Step

Found a dataset that I was interested in analyzing
Install Jupyter Notebook: have to open terminal on the Jupyter Notebook location
Install python (already have -- upgraded it)
Load dataframes for python to manipulate in Pandas

Read the csv data: rearranging the filtering the data

Data charts: install matplotlib using pip install matplotlib

Challenges

Plot not showing up

Needed to execute all the cells -- not just the plot code
Made sure that there isn't another code cell that is causing an issue -- restarted the kernel and ran the cell again
Isolate the plot code in a new Jupyter Notebook cell -- to isolate any interference from other cells.

Had new_dfP = (df.iloc[:,7:17] == 1) | (df.iloc[:,7:17] == 2) to filter the original dataframe, but should be a mask for selected rows that has values of either 1 "yes" or 2 "no".

Incorrect code: plt.bar(new_dfP.index, countY, label = 'Health Conditions', width = 0.4)
Fixed code: plt.bar(df.index, countY, label='Health Conditions', width=0.4)

Debug by

countY = df.iloc[:, 7:17].eq(1).sum(axis=1); 
										print(countY)

found out that the countY was adding a column of the total number of yes pre-existing health conditions

Needed the countY to be the total number of yes/no for the bar chart

Added the

totalCountY = countY.sum();
										print(totalCountY)

Fixed the

plt.bar(df.index, totalCountY, label='Health Conditions', width=0.4);
											plt.bar(df.index, totalCountN, bottom=totalCountY, label='No Health Conditions', width=0.4)

figured that the "bottom" parameter isn't needed so we can plot two bars side by side.

Skills Learned & Tips

Filtering the data learned the difference from loc vs. iloc

Messed up as countN = (df.iloc[:,7:17].eq(2).sum(axis=1)) instead had countN = (df.iloc[:,7:17].eq(2)) == 1 since wanted to sum all the yes and no health conditions
It is important when learning a new language to figure out the proper document for the plots
If install a new library or add-in, rerun and close the Jupyter Notebook because it won't work if so

Overall Thoughts

It was really fun learning a new library in python. It challenged me a lot since it has been a while since I had coded in python -- but it was quick for me to pick it up and think a way to opimtize the code faster instead of having too many iterations. I hope to spend more time learning about Pandas and its capabilities to process data a bit more since it has only been one day since I heard of the library. The bar chart wasn't showing up. I was up late trying to figure it out, but went to sleep and figured it out in the morning.

Moral of the story: “Pandas...Not just an animal. Sleep is the best debug. :D”